10 research outputs found

    An efficient multiple precision floating-point Multiply-Add Fused unit

    No full text
    Multiply-Add Fused (MAF) units play a key role in the processor's performance for a variety of applications. The objective of this paper is to present a multi-functional, multiple precision floating-point Multiply-Add Fused (MAF) unit. The proposed MAF is reconfigurable and able to execute a quadruple precision MAF instruction, or two double precision instructions, or four single precision instructions in parallel. The MAF architecture features a dual-path organization reducing the latency of the floating-point add (FADD) instruction and utilizes the minimum number of operating components to keep the area low. The proposed MAF design was implemented on a 65 nm silicon process achieving a maximum operating frequency of 293.5 MHz at 381 mW power. © 2015 Elsevier Ltd. All rights reserved

    A Novel Delta-Sigma Control System Processor and Its VLSI Implementation

    No full text
    This paper describes a novel control system processor architecture based on DeltaSigma modulation known as the DeltaSigma -CSP. The DeltaSigma -CSP utilizes 1-bit processing which is a new concept in digital control applications with the direct benefit of making multi-bit multiplication operations redundant. A simple conditional-negate-and-add (CNA) unit is instead used for operations in control law implementations. For this reason, the proposed processor has a very small silicon footprint and runs at very high frequencies making it ideal for high-sampling rate, real-time control applications. A number of DeltaSigma -CSP configurations have been implemented as VLSI hard macros in a high-performance 0.13-mum CMOS process and a particular configuration achieved a post-route operating frequency of 355 MHz resulting in a 2.17 MHz sampling rate for a fourth-order control law implementation. Additional results prove that the DeltaSigma -CSP compares very favorably, in terms of silicon area and sampling rates, to two other specialized digital control processing systems, including direct, hardwired implementation of control laws; at the same time, it substantially outperforms software implementations of control laws running on very wide, general-purpose VLIW architectures

    Fully systolic FFT architecture for giga-sample applications

    No full text
    We present a novel 4096 complex-point, fully systolic VLSI FFT architecture based on the combination of three consecutive radix-4 stages resulting in a 64-point FFT engine. The outcome of cascading these 64-point FFT engines is an improved architecture that efficiently processes large input data sets in real time. Using 64-point FFT engines reduces the buffering and the latency to one third of a fully unfolded radix-4 architecture, while the radix-4 schema simplifies the calculations within each engine. The proposed 4096 complex point architecture has been implemented on a FPGA achieving a post-route clock frequency of 200 MHz resulting in a sustained throughput of 4096 point/20.48 μs. It has also been implemented on a high performance 0.13 μm, 1P8M CMOS process achieving a worst-case (0.9 V, 125 C) post-route clock frequency of 604.5 MHz and a sustained throughput of 4096 point/3.89 μs while consuming 4.4 W. The architecture is extended to accomplish FFT computations of 16K, 64K and 256K complex points with 352, 256 and 188 MHz operating frequencies respectively. © 2009 Springer Science+Business Media, LLC

    Thread-parallel MPEG-2 and MPEG-4 encoders for shared-memory system-on-chip multiprocessors

    No full text
    This work focuses on speeding up MPEG-2 and MPEG-4 encoding by using thread parallelism for shared-memory, System-on-Chip (SoC) multiprocessors. Improving the performance of the MPEG encoders is shown by reducing the dynamic instruction count at multiple processor contexts and then mapping onto a configurable SoC multiprocessor. The resulting reduction in the dynamic instruction count of the parallelized MPEG-2 TM5 encoder for 32 processor contexts reaches a maximum of 95% and that of the MPEG-4 XViD a maximum of 83% for 16 processor contexts, both compared to the sequential encoder. To realize the parallelized encoders we present a configurable, N-way, extensible, bus-based, cache-coherent SoC multiprocessor, augmented with data-parallel coprocessors, and we give the VLSI implementation for the 2-way and 4-way configurations

    Customization of an embedded RISC CPU with SIMD extensions for video encoding: A case study

    No full text
    This work presents a detailed case study in customizing a configurable, extensible, 32-bit RISC processor with vector/SIMD instruction extensions for the efficient execution of block-based video-coding algorithms utilizing a proprietary co-design environment. In addition to the default Full-Search motion estimation of the MPEG-2 Test Model 5, fourteen fast ME algorithms were implemented in both scalar and vector form. Results demonstrate a reduction of up to 68% in the dynamic instruction count of the full search-based encoder whereas the fast motion estimation algorithms achieved a reduction in instruction count of nearly 90%, both accelerated via three 128-bit vector/SIMD instructions when compared to the scalar, reference implementation of the standard. We address in detail the profiling, vectorization and the development of these vector instruction set extensions, discuss in depth the implementation of a parametric vector accelerator that implements these instructions and show the introduction of that accelerator into a 32-bit RISC processor pipeline, in a closely-coupled configuration. © 2007 Elsevier B.V. All rights reserved
    corecore